RESLVE: Leveraging User Interest to Improve 
Entity Disambiguation on Short Text 



Elizabeth L. Murnane 

Cornell University 

Information Science 

elnn236@cornell.edu 



Bernhard Haslhofer 

Cornell University 

Information Science 

bh392@cornell.edu 



Carl Lagoze 

University of Michigan 

School of Information 

clagoze@unnich.edu 



m 

o 

Oh 
< 

00 



o 



o 

(N 

^' 

o 
m 



^ 

cd 



ABSTRACT 

We address the Named Entity Disambiguation (NED) prob- 
lem for short, user-generated texts on the social Web. In 
such settings, the lack of linguistic features and sparse lex- 
ical context result in a high degree of ambiguity and sharp 
performance drops of nearly 50% in the accuracy of conven- 
tional NED systems. We handle these challenges by develop- 
ing a model of user-interest with respect to a personal knowl- 
edge context; and Wikipedia, a particularly well-established 
and reliable knowledge base, is used to instantiate the proce- 
dure. We conduct systematic evaluations using individuals' 
posts from Twitter, YouTube, and Flickr and demonstrate 
that our novel technique is able to achieve substantial per- 
formance gains beyond state-of-the-art NED methods. 

Categories and Subject Descriptors 

H.3.3 [Information Storage and Retrieval]: Informa- 
tion Search and Retrieval; 1.2.7 [Artificial Intelligence]: 
Natural Language Processing 

General Terms 

Algorithms, Design, Experimentation, Human Factors 

Keywords 

Entity Resolution; Social Web; Semantic Knowledge Graph; 
User Interest Modeling; Personalized IR 

1. INTRODUCTION 

Named Entity Recognition (NER) refers to the systematic 
process of identifying mentions of entities such as people, 
places, concepts, or ideas in unstructured text. The Named 
Entity Disambiguation (NED) problem arises when an en- 
tity has multiple candidate meanings, and it is particularly 
challenging when texts are short, linguistic features are un- 
reliable, and local lexical context is sparse. This kind of sce- 
nario is epitomized by the highly individualized text-based 
utterances found in online social media platforms, where the 
amount of such content is continually expanding and the 
need to extract information and knowledge from it grows. 

The following examples illustrate the difficulty in deter- 
mining the unique, intended sense of an ambiguous entity. 
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A tweet, posted on Twitter, the fastest growing micro-blog- 
ging service, where the amount of tweets produced in a day 
is equivalent to a 10 million page booljj 

aaahh one more day until flnnU! #cantwait 

A video title posted on YouTube, a video sharing website 
with over 800 million visitors each montlo 
the office holiday party 

A photo tag on Flickr, which hosts over 7 billion images, a 
figure nearly double the amount from just 4 years agqj 
Beetle 

According to their corresponding Wikipedia Disambiguation 
pages, these entities have dozens of candidate meanings. For 
instance, "finn" is both a popular TV character or a travel 
destination; "office" might refer to the concept of a work- 
place, an American television show, or a British television 
show; and "Beetle" could mean the animal, one of several 
vehicle models, a botanist, or a simple dice game. 

In each of these cases, the lexical context in which the am- 
biguous entity is contained does not help to definitively de- 
termine the user's intended meaning or is absent completely. 
Attempting to utilize prior tweets, videos, or photos posted 
by the user in the same social Web platform as a source of 
background knowledge is not an effective approach either, 
both because users do not generally post a large enough 
volume and because the ambiguity in those posts is often 
equally as high, offering no informative context 26 . 

Thus even though NED is a well-established problem with 
much extant research devoted to it, this short text writ- 
ten on social media is fundamentally different than the long 
and formal text on which traditional approaches are trained 
[13]; and consequently, unique identification of entity con- 
cepts is severely degraded for even state-of-the-art conven- 
tional methods [24]. For instance, the Fl score of the Stan- 
ford NER, which is trained on the CoNLL-Oqj news article 
dataset, drops from over 90% to just under 46% when ap- 
plied to a Twitter dataset 21 . 

With researchers beginning to recognize this, the area of 
entity recognition within social media has begun to draw 
some recent attention. Several strategies use crowd-sourcing 
techniques for linking entities, but these methods require a 
large number of reliable human workers 9, 10 . Automated 
approaches to NER have emerged, but they also face limi- 
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tations. Namely, most only handle entity extraction with- 
out addressing the disambiguation problem, and nearly all 
have only been trained to handle tweets, making it unclear 
whether results are generalizable to personal utterances out- 
side of Twitter 8, 18, 211 [sT]. 

Our work moves beyond these strategies and establishes 
individual-centric procedures to automatically disambiguate 
the short yet personally relevant text-based utterances a user 
makes on a variety of platforms. Our solution compensates 
for sparse lexical context with personal context, building a 
model of user interest from external structured semantic 
data. Essentially, we propose that an ambiguous entity's 
intended meaning is the candidate concept most similar to 
a user's core set of personal interests. 

We introduce a novel system called RESLVE (Resolving 
Entity Sense by LeVeraging Edits) that augments tradi- 
tional disambiguation techniques by implementing this per- 
sonalized approach. RESLVE relies on state-of-the-art ser- 
vices such as Wikipedia Miner 28 and DBPedia Spotlight 
[25| for extraction of entities and candidate meanings, and 
our improvements to the disambiguation phase can increase 
the overall effectiveness of such tools. The central contribu- 
tions of this paper are: 

• A model for representing user interest with respect to a 
knowledge base and its categorical organization scheme. 

• A ranking technique that takes as input a user interest 
model along with an ambiguous entity's set of candidate 
topics and outputs with improved accuracy the topic 
most likely intended. 

• An annotated dataset of disambiguated entities from 
Twitter, YouTube, and Flickr along with the results of 
an empirical evaluation of our system. 

The dataset and system implementation are publicly avail- 
able at https://github.com/emurnane/RESLVE 

The remainder of the paper is organized as follows. Sec- 
tion|2]presents the theoretical underpinnings of our approach 
to model user interest according to the text-based traces peo- 
ple leave on the Web. Section [3] details our implementation 
of the model as well as our ranking technique that measures 
the relevance of an ambiguous entity's candidate topics with 
respect to this model. Section [4] describes the experiment 
we performed to evaluate the method, offering insights and 
statistics about our datasets and details of data prepara- 
tion. Section |5] presents results, comparing our performance 
to baseline methods and NED services and identifying cases 
of efficacy and error. Finally, Section |6] offers concluding 
remarks and directions for future research. 



2. MODELING USER INTEREST 
2.1 Intuitions and Illustrations 

The underlying assumptions of our approach are that a 
user possesses a core set of interests, a user is more likely 
to mention an entity about a topic drawn from a domain 
of personal interest than from a domain of non-interest, and 
these interests can be formally modeled as a personal knowl- 
edge context. Thus by bridging user identity between social 
media (e.g.. Twitter, YouTube, or Flickr) and a knowledge 
base (e.g., Wikipedia), we propose to resolve the intended 
meaning of ambiguous entities encountered in the former by 



leveraging the power of structured information relevant to 
personal interests available in the latter. 

To aid understanding of these ideas, we present qualitative 
examples based on the Web traces of users in our dataset. 
Earlier, we gave an example of an ambiguous YouTube title: 
'Hhe office holiday party". Without seeing the video, even 
a human is unable to determine from title alone whether 
the video contains footage of a workplace, TV episode, or 
some other office-related subject matter; and extra detail 
like "episode 4" still does not aid manual disambiguation be- 
tween the US and UK shows. Inspecting the user's YouTube 
profile, we find other video titles mentioning this entity, such 
as 'Hhe office , december <5", but that information does not 
help in figuring out the meaning of the entity in question 
either. Examining the user's log of recent Wikipedia ed- 
its, however, allows us to resolve at once that the likely 
meaning is the US TV show: <item userid=xx user=xx 
pageid=31841130 title =The Office (U.S. season 8)/ > 

Next, a key assumption is that the interests expressed by 
a user on various Web platforms do not wander considerably 
among an incoherently diverse set of topics but rather are 
relatively consistent. To quantify this extent to which user 
interests overlap across platforms, we evaluate the coverage 
that Wikipedia categoric^ provide for topics mentioned in 
users' social Web utterances. We find that on average within 
our dataset, 54.2% of the entities a user mentions in so- 
cial media (e.g., "Java") have one or more of their candidate 
meanings covered by a direct category of an article that same 
user edited on Wikipedia (e.g., "ProgrammingJanguage") . 
Figure [l] provides an illustrative example of a representa- 
tive user's overlap of topics mentioned on both Twitter and 
Wikipedia. Each circle represents a topic with size propor- 
tional to the number of times the user uttered or edited that 
topic; and topics are grouped according to their proximity 
in Wikipedia's category hierarchy. For instance, a majority 
of this user's tweets and edited articles deal with informa- 
tion technology concepts, frequently overlapping on topics 
of programming languages, internet browsers, and security 
vulnerabilities. In fact, of the 178 entities this user mentions 
in tweets, 74.7% of them have a candidate meaning belong- 
ing to a direct category of an article she contributed to in 
Wikipedia (i.e., a category 1 edge away in the category hi- 
erarchy), 90.4% belong to a category 2 edges away, 98.9% to 
a category 3 edges away, and all 100% of the entities men- 
tioned in tweets are covered by moving only 4 edges up the 
category hierarchy starting at some article the user edited. 




Figure 1 : Example of the overlap in topics a user utters on 

Twitter (blue), on Wikipedia (purple), and on both 

platforms (red) 
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We now buttress these intuitions by looking both to theory 
and vaUdated experimental results from prior research. 

2.2 Theoretical Motivations 

Firstly, our work builds on social psychology theories that 
say a user possesses intrinsic interest in a key set of topics. 
Expressing these interests is personally fulfilling, motivating 
consumption and contribution of content about those top- 
ics to online communities [15 17 23], especially compared 
to random topics; for instance, people are more likely to 
read and respond to posts mentioning movies of personal 
interest than to random posts [11| |20| . In addition, prior 
research shows that a user's online contributions in social 
media and knowledge production communities are represen- 
tative of that individual's main topics of interest, and this 
research further establishes that these interests can be mod- 
eled according to lexical features of her text-based contribu- 
tions K| [6 , 29 . The next section describes how we represent 
user interest with respect to a general knowledge base. 

2.3 Modeling a Knowledge Context 

Put simply, a knowledge base is a semantic network used to 
organize entities, their types, and the relations among them. 
Modeling topics with respect to this structured information 
allows us to formally represent the topics in which a user 
is interested and the topics corresponding to an ambiguous 
entity's candidate meanings. In Section [3] we instantiate 
our devised model using Wikipedia. 

We begin with some definitions. First, we define a knowl- 
edge base K as a, directed graph K = (N,E), consisting of 
sets of nodes N and edges E, as illustrated in Figure |2] We 
distinguish between two types of nodes: 

Category nodes. Ncategory C N with each C G Ncategory 

having a unique identifier i ^ I and a set of semantic re- 
lationships r with other nodes, so that c = {i,R}. As an 
example, a category Car has a rel : broader relationship 
to category Vehicle, and both category Vehicle and category 
Elephant have rel : broader relationships to category Thing. 

Topic nodes. Nropic C N with each t G Nropic also having 
a unique identifier i ^ I, belonging to one or more categories, 
and carrying a textual description d so that t = {i,d,C} 
with C C Ncategory For cxamplc, Volkswagen Beetle is a 
topic belonging to category Car and having an associated 
descriptive document, which in the case of Wikipedia could 
be the text on the Volkswagen Beetle article page. 

With this definition of a knowledge graph in hand, we can 
build the interest model of a user u. To begin, we treat 
the user's text-based contributions to the description of a 




topic t as a signal of interest in that topic. For each such 
topic ti ^ Tu, where Tu C Nropic is the set of all topics a 
user has contributed to the description of, we construct a 
topic-interest graph rooted at node ti and consisting of all 
categories Cu ^ Ncategory that are reachable from ti by fol- 
lowing outgoing edges to parent categories in the knowledge 
graph K. (Note that we represent a candidate meaning of 
an ambiguous entity in the same way; the only difference is 
that a candidate is associated with the single topic that cap- 
tures its meaning, while the representation of a user contains 
each topic to which that user has contributed) . Formally, we 
represent user interest / in topic ti as: 

It, = {Nu,Eu), where Nu, =tiUCu and Eu, C E 
We ensure each Ut, is a bipartite graph by applying the 
following straightforward transformation. We initialize all 
edge weights in Uti to 1. Then, starting at node ti, we tra- 
verse each outgoing edge e(i^j) = {ra^nj) until we encounter 
an edge between two categories, that is an edge where both 
rii^rij ^ Cu, at which point we apply the transformation: 

a. Compute p = shortest path length between ti and Uj 

b. Remove the edge e(^,j) and replace it with an edge from 
ti to Uj that has an edge weight equal to - . If such an edge 
e(t.,n ) already exists, assign it a new weight equal to the 
greater of its current weight and - . Thus the interest model 
of user u is simply an aggregated set of all topic-interest 
graphs built from each topic the user has shown direct or in- 
direct interest in, formally defined as u = {Iti , 7^2 , • • • , h^}- 

To summarize, we now have a consolidated user-interest 
graph consisting of all topic nodes a user has contributed 
to, connected to category nodes by edges weighted with a 
value w,0 < w < 1. As this aggregation of multiple topic- 
interest graphs may result in a graph with duplicate cate- 
gories, we can eliminate duplicates by simply first removing 
any category node c having the same identifier as another 
category c in the graph and then taking any edge originally 
between a topic and c and redrawing it to c instead. To 
illustrate this process, let a user's set of contributions corre- 
spond to {ti , t2, ts} as shown in Figure[2] Applying the steps 
to transform the graph into a bipartite graph and removing 
duplicate category nodes results in Figure [3] 

It is also possible at this point to represent the user inter- 
est graph as a topic-category matrix, which is the analogue 
of a term-document matrix that represents documents in a 
vector space model. Figure [4] illustrates such a matrix built 
from the topic-interest graph shown in Figure |3] Each row 
corresponds to a vector for topic t^ , each column corresponds 
to a category c ^ Cu, and each topic-category entry equals 
the weight of the edge between those two nodes in the user's 
interest graph or if no such edge exists. 

Having provided a walkthrough of our theoretical motiva- 
tions and a method for constructing a model of user interest. 




Figure 2: Knowledge graph of categories, topics, and 
descriptions 



Figure 3: User topic-interest graph 
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Table 2: Edits and text that are not concept-bearing 



1/2 
Figure 4: Edge- weight matrix of user interest graph 



we next describe the full details of our implemented system. 
We could instantiate our model on any knowledge base with 
the above structure such as Wikipedia, DBPedia, and Free- 
base, which are all high coverage knowledge bases employed 
as sources of entity and concept representations for Twitter 
[16| , YouTube 2 , and Flickr 32 . For this paper, we imple- 
ment on Wikipedia for reasons described in Section [3?!] 

3. THE RESLVE SYSTEM 

To determine the intended meaning of an ambiguous en- 
tity detected in a user's short text, we compute the relevance 
of each candidate topic to which an entity could refer to the 
user's interest model, which is based on the topics that user 
contributed to in a knowledge base, for example Wikipedia. 

3.1 Wikipedia as a Knowledge Context 

Wikipedia is both popular and powerful, capable of fa- 
cilitating entity recognition, linking, and disambiguation |3, 
[7, 24, 27, 28 . For this paper, we implement our model us- 
ing Wikipedia both because it has been well established as 
a rich source of external information and because it offers 
additional advantages particularly relevant for our task. 

First, Wikipedia has been shown effective in modeling user 
interests since editing behavior on the site serves as an indi- 
cator of interest and because the site's organizational struc- 
ture provides a way to formally represent those interests. 
Specifically, Wikipedia editors seek out articles about top- 
ics of personal interest, revision histories catalog these top- 
ics along with valuable metadata, and resources like article 
pages and category graphs effectively represent the topics 1 6 , 
[19, 33, 34, 36 . In addition, research analyzing social media 
shows that users frequently mention topics of personal in- 
terest in tweets and that Wikipedia's category structure can 
be used to represent those topics [26]. Also, Wikipedia pro- 
vides broad coverage of domain-independent named entity 
concepts and rare word senses 



18) 37 , which is key given 



the diversity of topics users talk about on the social Web. 

Finally, article-editing behavior lends itself well to for- 
mulating the strength of a user's interest in a given topic. 
Table [l] summarizes contribution characteristics that are in- 

Table 1 : Editing behaviors indicative of user interest 
Editing Behavior Intuition 

Number of times Repeated editing implies greater 
user edits article investment and interest 



Type of edit 



Trivial edits (see Table [2| are 
weaker signal of interest 



Article's global edit 
activity and number 
of editors 



Generally popular articles are 
less discriminative of individual 
interest and personal relevance 



Editing time span 



Long-term interests are stronger 
than fleeting ones 



Edit quality w.r.t. 
Info. Qual. metrics 



Substantiveness and quality in- 
dicate concern in topic 



Edits Ignored: 

• Trivial (typo fixes, 
vandalism reverts) 

• Articles with under 

100 non-stopwords 



Patterns Cleaned: 

• Stopwords, punct. removed 

• Article maintenance info in 
Wiki Markup removed 

• Stem, tokenize, lowercase 



dicative of interest and can be incorporated into a weighting 
formula, along with underlying intuitions as to why such be- 
havior is meaningful. For this experiment, we factor in only 
the number of edits and the edit type. 

Some work that Wikipedia users perform, however, is not 
indicative of personal interest '6' and is therefore filtered 
out when constructing the interest model. The left column 
of Table |2] summarizes such irrelevant edits, and the table's 
right column lists the further processing we perform to clean 
all article text. 

To reiterate, our approach simply takes each Wikipedia 
article a user has made non-trivial contributions to and con- 
structs an interest model according to the specification de- 
scribed in Section [2. 3| That is, we treat articles as the top- 
ics comprising Nropic, use Wikipedia "Category:" resources 
to represent the categories in Ncategory, and consider the 
text content on an article page to be that topic's unique 
description. 

We now move on to describe the modules in our sys- 
tem framework, RESLVE, that addresses the disambigua- 
tion problem by (I) connecting a user's social Web identity 
and Wikipedia editor identity, (II) modeling that user's per- 
sonal interests using her articles edited on Wikipedia, and 
(III) ranking entity candidates by measuring how similar 
each candidate's associated topic is to the most salient top- 
ics in the user's interest model. Throughout our discussion, 
we refer to Figure [5] which illustrates each of those phases. 

3.2 Phase I: Bridging Web Identities 

The first step is bridging the user's social Web identity 
with that same user's identity in the context of a knowledge 
base. Given an ambiguous text on social media, we detect 
the Wikipedia account that belongs to the same person to 
leverage its record of contributions as structured personal 
context that can be used to formally model interests. Our 
current approach is simple string matching of account user- 
names since prior research demonstrates feasibility [14[ |30| . 
See Section |4] for validation of the technique. 

3.3 Phase II: Representing Users and Entities 

RESLVE next ranks an ambiguous entity's candidate top- 
ics by measuring how relevant each one is to the user's inter- 
est model built from the bridged account. RESL VE ta kes 
advantage of both a topic's description (Section |3.3.1 ) as 
well as its semantic relationships in the knowledge-graph 
(Section 1 3.3.2 ) in order to get an overall measure of relat- 
(Se 



edness (Section 3.4) between the user's interests and each 



candidate meaning. We represent the user u and the candi- 
date meaning m as weighted vectors in order to use classic 
information-retrieval techniques to measure their similarity. 

3. 3. 1 Content-Based Similarity 

To measure content-based similarity, we use a bag-of- words 
approach, building both for u and for m a TF-IDF weighted 
term vector from the titles of the articles a user has edited, 
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Figure 5: The RESLVE system performs disambiguation by (I) connecting a user's social web identity and Wikipedia editor 

identity, (II) modeling personal interests in terms of topics associated with the articles the user edited on Wikipedia, and 

(III) ranking entity candidates by measuring how similar each candidate's topic is to the user's interest model 



the candidate's article title, the tokenized words from all 
those articles' pages, and the titles of those articles' cate- 
gories, sirricontent is the cosine-similarity measurement be- 
tween the user vector Vcontent,u and the candidate meaning 
vector Vcontent,m- Givcu a total of 7 terms extracted from 
articles' titles, descriptions, and categories, we have: 

Vcontent,u = {tfidf{u, tl), . . . , tfidf{u, tj)} 

Vcontent,m = {tfidf{m, tl), . . . , tfidf{m, t^)} 

SZUlcontentyU, 171) ^^ COS SlTTlyV content, u-, VcontentjTn ) 

3.3.2 Knowledge-Context Based Similarity 

To measure relevance based on semantic relationships from 
the knowledge graph, we build vectors now using articles' 
category IDs. Vcategory = {wgraph{categoryi) \ categoryi G 
(Cti U Cm), Cii, Cm ^ N}, where each categoryi is a cate- 
gory with an edge to an interest topic or a candidate topic. 
To consider not only occurrence but position of a category 
in the knowledge graph, the function w graph measures a cat- 
egory's "semantic relevance" to a topic, denoted dist{c), as 
the edge weight between the topic and category (the shortest 
path length as explained in Section 2.3 ). This scheme assigns 
more weight to a close and directly relevant category of a 
topic such as "American Television Series" than to a category 
far away and too general such as "Broadcasting". We also 
do factor in the occurrence of a category, denoted freq{c), 
which is the number of user- or candidate-relevant articles 
the category has an edge with in the knowledge graph. For a 
candidate, this value is either or 1 depending on whether 
the category is present or absent in the hierarchy of cate- 
gories originating from the candidate's corresponding arti- 
cle. Formally the weighting formula is defined as: 



WgraphyCi^, U) 
Wgraph{Ci,m) 



dist{ci) * freq{ci), 


■ freq{ci), 



eCu 

^ Cu 

G Cm 
^ Cm 



dist{ci 



where Cu , Cm ^ N 
The semantic relevance between u and m is then: 

SlTTlcategory\U^ TTlJ — COSSlTTlyV category ,ut V category ,m) 

3.4 Phase III: Ranking by Personal Relevance 

All candidate meanings for an ambiguous entity are scored 
using the composite formula sim{u, m) = a^sim content {u, m) 
+ (1 — a) * sirricategoryiu^m) where a is a weighting pa- 



rameter determined experimentally. RESLVE outputs the 
highest scoring candidate as the user's intended meaning. 

Given that Wikipedia has millions of articles and hun- 
dreds of thousands of categories, it is necessary to avoid 
vector dimensionalities that make computation impractical. 
For the future we will explore a pruning strategy that re- 
quires new vector components to meet a "relatedness" thresh- 
old to the current vector in addition to using standard strate- 
gies that remove very high and low frequency terms. 

4. EXPERIMENTS 

4.1 Data Collection and Preparation 

We collected tweets posted on Twitter, videos posted on 
YouTube, and photos posted on Flickr. After eliminating 
users whose content was not written in English and the ac- 
counts that had since been deleted or had public permissions 
removed, the remaining usernames were fed to the identity 
module (Module I in Figure|5]), which output usernames that 
existed on Wikipedia too. We then employed annotators 
to confirm each bridged account belonged to one individual 
person in order to both assess the effectiveness of our string- 
matching approach (see Section 3.2) and to ensure that the 



accounts used in the experiment were not false positives that 
would pollute results. Section |5.4.2| summarizes the anno- 
tator judgments in Table [S] and offers solutions to overcome 
any low username-reuse encountered. 

Next, guided by thresholds used in recent research [12] 
22 38 , accounts with less than 100 social Web utterances 



or 100 lifetime Wikipedia edits were removed as inactive. We 
downloaded the remaining users' most recent 100 short text 
utterances on each social Web platform - specifically, their 
tweets, YouTube video titles and descriptions, and Flickr 
photo tags, titles, and descriptions. We also collected the 
ID, title, page content, and categories for every article the 
user edited on Wikipedia in order to build a comprehensive 
personal context. We did not consider interest drift here, 
but we have future plans to investigate whether temporal 
attributes of a user's utterances and edits affect the observ- 
able interest overlap. 

We passed the short texts to the pre-processing module 
of Figure [5] to be cleaned and filtered in the ways listed in 
Table |3] RESLVE then extracted entities from the short 
texts using Wikipedia Miner and DBPedia Spotlight with 



Table 3: Short text pre-processing and normalization 
Tweets: • Normalize ©name to MENTION 

• Remove RT (retweet) tag 

• Remove leading "#" but keep hash tag's 
target concept if English word 



YouTube, 
Flickr: 



All utter- 
ances: 



• Bypass auto-generated file names like 
IMG_336.jpg or MOV_02.AVI 

• Remove file type suffix, e.g. ".png", but 
leave file name if an English word 

• Ignore auto-generated tags, e.g., "hid- 
den:filter=Boost" machine-tag on Flickr 

• Remove URLs 

• Remove non-English 



disambiguation threshold parameters set to in order to 
retrieve all potential candidate meanings. The filters in Ta- 
ble |4] removed invalid entities, and we gave the remaining 
1545 valid entities to Mechanical Turkers tasked with label- 
ing whether each candidate provided by the NER services 
was the correct meaning or not. Three different annotators 
labeled each candidate, and we required workers to be Cat- 
egorization Master^ or have at least 95% approved hits in 
prior tasks. We removed entities for which Turkers did not 
unanimously select a correct sense, leaving 918 labeled am- 
biguous entities. In the final step, ambiguous entities were 
passed to module III (see Figure [sl, which ranks candidates 
according to their similarity with the user interest model and 
outputs the most likely intended meaning. We give descrip- 
tive statistics about the texts and entities in Sectio nJ4.2| and 
we present RESLVE's ranking precision in Section [5] 

4.2 Dataset Characteristics 

Here we report on interesting attributes and general trends 
found in our dataset. Figure |6] shows the distributions of 
text length (number of characters) for utterances on Twitter, 
YouTube, and Flickr as well as for content in Reuters-2157qj 
and Brown- Corpu^ collections. The last two are corpora on 
which standard NED tools are trained and tested. Apparent 
is the very short nature of social Web posts, especially photo 
tags, photo titles, and video titles; and the longest texts a 
user writes on the social Web are still generally shorter than 
even the shortest texts from traditional NER task corpora. 
The character limits imposed on these social platforms also 
stand out as peaks in the graph. Tweets can be at most 140 
characters: Flickr has a 500-character limit: and YouTube 



Table 4: Filters applied to entities 

Language • Non-English 

based: • Single characters and parse errors 

Entity • Non-entities, i.e., detected terms that are 

based: not a Noun class (NN, NNS, NNP, NP) or 

Named Entity class (e.g., location, per- 
son, organization) according to named 
entity corpora lEER, ACE, or CoNLL 
• Non-ambiguous entities (0 or 1 meaning) 



https : //www . mturk . coin/mturk/help?helpPage=worker| 
^http://kdd. ics .uci.edu/databases/reuters21578/ 

reuters21578.html 
^ http : //www . hit . uib . no/icame/brown/bcm . html 
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Figure 6: Freq (y) of chars (x) in various corpora 



allows 1000 characters for video descriptions but only 250 
show in a preview area, so users' self-limiting causes a no- 
ticeable peak there. 

Furthermore, we see the occurrence of entities with mul- 
tiple meanings is high. Table [5] summarizes the proportions 
of tweets, titles, descriptions, and tags that contain at least 
one ambiguous entity as well as the percentage of detected 
entities that are ambiguous. Across all 3 social Web sites, 
91% of users' content contain one or multiple ambiguous en- 
tities; nearly two-thirds of all detected entities on these sites 
are ambiguous; and very few entity-containing short texts 
contain only non-ambiguous entities. 

Table 5: Ambiguity on social Web. (a): texts containing 
ambiguous entities; (b): entities with ambiguous sense 



Twitter YouTube 



Flickr 




Title Desc Tag 

92% 97% 77% 
66% 44% 73% 



A final observation is that not only do ambiguous enti- 
ties appear frequently, but the degree of their ambiguity is 
high as well. Wikipedia Miner and DBPedia Spotlight as- 
sign each candidate meaning a probability score indicative of 
the likelihood it is the correct sense. Figure |7| shows the dis- 
tribution of this score for each top ranked candidate. We see 
the confidence the services are able to assign their selected 
candidates is very low, signaling that unreliable knowledge 
underlies the ranking. The confidence scores of many cor- 
rect candidates actually fall below the default threshold and 
would erroneously not even be retrieved. Also, not only do 
established techniques have difficulty selecting a candidate 
with high probability, but there are often many candidates 
from which to choose (in our dataset, this number ranges 
from 2 to 163 with an average of 5-6 and median of 4). All 
together, this makes the NER services' candidate selection 
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Figure 7: Confidence scores for top candidate 



akin to a random guess with low chance of choosing the 
actual intended meaning. 

Overall, this analysis confirms entity ambiguity in short 
texts is a profuse, difficult problem requiring attention. 

5. RESULTS 

In this section, we detail the evaluation and performance 
analysis we conducted in order to compare the correctness of 
our ranking algorithms to a gold standard of human annota- 
tor judgments, a number of baseline measures, and existing 
state-of-the-art techniques. We evaluated our user-centric 
approach with short texts drawn from three separate sources 
(Twitter, YouTube, and Flickr) since inspection of posts on 
these sites revealed qualitative differences. Table [6] reports 
performance as precision at rank 1 (P@l), or the fraction 
of evaluated entities for which the top ranked candidate is 
the correct meaning; here this measure is the same as recall 
since we assume the NER services provide us every potential 
candidate sense. 

5.1 Human Annotated Ceiling 

As explained in Section [4!T] in order to obtain labeled data 
for an ambiguous entity, we used Mechanical Turk to deter- 
mine whether or not a candidate was the correct meaning 
of an ambiguous entity. We see an average observed agree- 
ment across all coders and items of 0.866 and average Fleiss 
Kappa=0.803. Both are within generally acceptable ranges 
for this type of task, indicating it is possible to disambiguate 
these texts manually and feasible to try it mechanically. 

5.2 Comparison to Baselines 

• Random Candidate (RC) ranks in random order 

• Prior Frequency (PF) ranks candidates according to 
their commonness, i.e., their prior frequency measure 

• RESLVE with Random-User (RU) applies our method 
with a random Wikipedia user's interest model as input 
rather than that of the user who uttered the entity 

• Wikipedia Miner (WM) and DBPedia Spothght (DS) 
are established NED techniques 

5.3 Impact of Text Nature and Length 

Evaluation shows that incorporating background knowl- 
edge about user-specific interests can outperform traditional 
strategies that rely on prior word frequencies or linguistic re- 
lationships among words in the local context. In particular, 
RESLVE performs best on YouTube, the longest texts in 
our dataset (see Figure [6| , mainly because of content-based 
similarity with the user-interest model. It also outperforms 
existing NER services on Twitter texts, which are generally 

Table 6: Precision (P@l) of ranking method s 



Table 7: Errors made by RESLVE 



Example 



Reason 





Flickr 


Twitter 


YouTube 


RESLVE 


0.63 


0.76 


0.84 


RC 


0.21 


0.32 


0.31 


PF 


0.74 


0.69 


0.66 


RU 


0.51 


0.71 


0.78 


WM 


0.78 


0.58 


0.80 


DS 


0.53 


0.67 


0.63 



''Peter on Referent entity not in Wikipedia, so 
the dock" overlap with user model impossible 

"I uploaded RESLVE assumes every entity is 
a video on user-specific and always selects can- 
@youtube" didate with most user model over- 
lap, here ranking "1945 European 
Films" highest for "video" for a user 
who edits European cinema pages 

RESLVE Performance depends on amount of 

assigns content user produces in the knowl- 

correct edge base and drops with users who 

candidate edit shorter articles or make fewer 

a 0.0 score contributions to Wikipedia overall 



more personal than those on Flickr or YouTube and there- 
fore less often simply refer to the most common sense. This 
shows that considering user interest for NED can be effective 
in highly user-centric domains. 

Along the same lines, RESLVE with a random user's in- 
terest model as input can perform better than baselines since 
incorporating external data allows more topic overlap with 
candidate entities; but it is not as accurate as with the per- 
sonalized input, showing that user-specific data does help. 

Conversely, RESLVE is less effective on more impersonal 
text. Misrankings result from automated posts; and we see 
lower performance on Flickr data, where many entities re- 
fer to non- subjective topics (e.g., geographic places), which 
have high prior frequencies and can be ably resolved with 
traditional approaches. Also, examination reveals many of 
these mentioned places are non-local, non- familiar travel lo- 
cations and therefore not covered by a user's knowledge base 
contributions about acquainted topics. Table [7| and the next 
section explain additional error cases. 

5.4 Error Analysis and Future Work 

5.4.1 Incorrect Candidate Selection 

Analyzing RESLVE 's failure points reveals reasons for rank- 
ing mistakes and ideas for future improvement. 

5.4.2 User Identity 

The identity module (see Figure [51 simply performs string 
matching to determine whether a username exists on both a 
social media site and on Wikipedia, and Table [S] summarizes 
the results from our dataset. Comparing public profile in- 
formation between the social media accounts and Wikipedia 
accounts that do exist and share the same username, annota- 
tors unanimously found that Twitter- Wikipedia, YouTube- 
Wikipedia, and Flickr- Wikipedia username matches actu- 
ally belong to the same person in approximately 47%, 48%, 
and 71% of cases, respectively. In the remaining cases, an- 





Table 8: Usernames reused 




# Usernames 


Exist on Wikipedia 


Twitter 


479 


46.1% 


YouTube 


454 


19.6% 


Flickr 


226 


21.7% 



notators were generally unable to make any determination 
about user identity rather than confirm a username belonged 
to different people. While our simple approach succeeds in 
more cases than it fails, it does face limitations, and the rel- 
atively high costs of determining confirmed cross-platform 
user identities limits the sample size of our current evalu- 
ation. Improvement of this module is our first priority for 
future research. We identify 3 main sources of errors: 

1. False positives: String matching succeeds in retriev- 
ing accounts that exist on both the social Web and 
Wikipedia with the same username, but these accounts 
do not belong to the same individual person. 

2. True negatives: The individual who participates on the 
social Web does not hold an account on Wikipedia. 

3. False negatives: Accounts exist, but a string matching 
approach fails because they have different user names. 

To address the last case where a user contributed to a 
knowledge base but under a different username, we propose 
exploring a model that considers a larger array of fully and 
partially matching profile information. Recent work finds 
profile attributes such as name, email address, or hometown 
to be highly reliable identifiers [l] |4] [35] . 

To address the opposite case where the user has not con- 
tributed to a knowledge base, we propose collaborative fil- 
tering techniques to use the contribution histories of social 
connections (people a user friends, follows, or favorites on 
the social Web) as an approximation of the user's own in- 
terests. Extending the implementation to other knowledge 
bases besides Wikipedia is also possible, as is modeling user 
interest not only from contributions like article edits but 
also from other forms of participation such as page visits, 
favoriting, or bookmarking. 

6. CONCLUSION 

We addressed the Named Entity Disambiguation problem 
with a user-interest centered approach. We showed that a 
user tends to produce content within a scope of topics of per- 
sonal interest across multiple online platforms and that it is 
possible to formally represent these topics using structured 
semantic data that can serve as a personal knowledge con- 
text. We introduced a representation of user interest given 
any general knowledge base and implemented the model on 
Wikipedia as a popular and powerful instance. 

Our approach to the NED problem does not depend on 
local or language-specific information, which is often hard 
to process, unreliable, or missing entirely in user-generated 
content. We reveal the advantages of our strategy over a 
variety of baseline and state-of-the-art methods and achieve 
improvements in performance, especially when text contains 
content of a highly personal nature. 

As the Web and particularly the social Web evolves, this 
kind of content will only continue to extensively grow, and 
user profiles will become ever more inter-connected across 
domains. Not only will this compel innovative solutions 
to handle existing information retrieval tasks, but as these 
tasks become increasingly difficult, it will make the solutions 
introduced in this paper all the more feasible and effective. 
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